Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Phrases: make any2utf8 optional #1413

Closed
wants to merge 41 commits into from

Conversation

prakhar2b
Copy link
Contributor

@prakhar2b prakhar2b commented Jun 14, 2017

Updated benchmark (for text8)

Optimization Python 2.7 Python 3.6
original ~ 36-38 sec ~32-34 sec
recode_to_utf8=False ~19-21 sec ~20-22 sec

Speed improvement in Phrases module.
Phrases optimization benchmark (for text8)


Note-- Leaving it for the context of prior discussions

Optimization Python 2.7 Python 3.6 PR
original ~ 36-38 sec ~32-35 sec
cython (static typing) ~30-32 sec
any2utf8 (without cython) ~20-22 sec ~23-26 sec This PR
cython (with any2utf8) ~15-18 sec ~19-21 sec #1385

@prakhar2b prakhar2b changed the title Phrases: convert tokens into utf8 (any2utf8) only before save [WIP] Phrases: convert tokens into utf8 (any2utf8) only before save Jun 14, 2017
@piskvorky
Copy link
Owner

What is the memory impact of this change? The conversion was there for a reason, IIRC.

@prakhar2b
Copy link
Contributor Author

@piskvorky yes, we had a discussion here regarding this. This PR will be updated accordingly asap.

@prakhar2b prakhar2b changed the title [WIP] Phrases: convert tokens into utf8 (any2utf8) only before save [WIP] Phrases: apply any2utf8 on entire sentence instead of each word separately Jun 21, 2017
@@ -169,7 +169,9 @@ def learn_vocab(sentences, max_vocab_size, delimiter=b'_', progress_per=10000):
if sentence_no % progress_per == 0:
logger.info("PROGRESS: at sentence #%i, processed %i words and %i word types" %
(sentence_no, total_words, len(vocab)))
sentence = [utils.any2utf8(w) for w in sentence]

sentence = [w for w in (utils.any2utf8(u'_'.join(sentence)).split('_'))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few issues here -

  1. You are trying to split a bytestring (the result of the any2utf8 call) by '_' - this will not work in python3+ because literal strings are unicode by default. You've faced similar problems previously, so I think it would be helpful to understand character encodings at a conceptual level, and the differences between string handling in python2 and 3.

  2. Simply sentences = utils.any2utf8(u'_'.join(sentence)).split('_')) would be enough - no need for the extra [w for w in ...]

  3. We're not accounting for the possibility that a word in the sentence contains '_' here - it would be wrong to make implicit assumptions like these about user input, unless there was an explicit constraint in the API. Escaping could be an option - although I'm not sure it is feasible, performance-wise.

# sentence = [utils.any2utf8(w) for w in sentence]
# Unicode tokens in dictionary (not utf8)

sentence = [w for w in (utils.any2utf8('_'.join(w for w in sentence)).split('_'))]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should all be in the optimized path (C / Cython), so there's no point wasting time playing with Python calls.

The utf8 conversion overhead will be ~0, once this is optimized properly.

You can check with cython -a to see how much slow (Python) code is still left on the critical path. There should be none.

@piskvorky
Copy link
Owner

piskvorky commented Jun 23, 2017

This whole style of optimization is not desirable.

Optimize the code properly, by writing in low-level C/Cython, not by shuffling Python calls around, joining characters in Python or whatnot. That's not the way to gain a significant speed boost.

Use cython -a to verify there are no slow (Python) calls left, on any critical code path. Critical blocks should look like C, using primitive C data structures, not Python.

@tmylk
Copy link
Contributor

tmylk commented Jun 23, 2017

@piskvorky Some notes on the wider context of this work.

Unfortunately, re-writing code in C is outside of the scope of the June evaluation milestone in Prakhar's GSOC proposal submitted in March. Another part of this proposal is selecting a multi-thread/multi-process architecture for Phrases - he is running experiments for it in the joblib PR.

Once the proper benchmarks for the any2utf8 and Cython optimizations are submitted, these will be an improvement for the existing Phrases code. IMHO any improvements in the code are worth merging (of course unless they complicate too much which these changes don't).

These minor improvements to Phrases have been a good GSOC learning experience for Prakhar in preparation for FastText performance optimisation which is the main focus of his GSOC project.

@jayantj
Copy link
Contributor

jayantj commented Jun 23, 2017

I agree, even if in the future we follow a different approach, I think the current changes are worthwhile, as they do improve the times for Phrases significantly (of course, we need clearer benchmarks, but from initial results, it certainly seems so).

@piskvorky
Copy link
Owner

I disagree. This may be a good exercise for @prakhar2b , in preparation for fastText (like @tmylk says), but these utf8 changes obfuscate the code and are not the type of changes that the Phrases module needs.

Curious to see the benchmarks :)

if sentence_no % progress_per == 0:
logger.info("PROGRESS: at sentence #%i, processed %i words and %i word types" %
(sentence_no, total_words, len(vocab)))
sentence = [utils.any2utf8(w) for w in sentence]
if isinstance(sentence[0], bytes):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if sentence is empty?

if sentence_no % progress_per == 0:
logger.info("PROGRESS: at sentence #%i, processed %i words and %i word types" %
(sentence_no, total_words, len(vocab)))
sentence = [utils.any2utf8(w) for w in sentence]
if isinstance(sentence[0], bytes):
sentence = [w for w in (utils.any2utf8(b';'.join(sentence)).split(b';'))]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the sentence tokens contain ;?

@prakhar2b prakhar2b changed the title [WIP] Phrases: apply any2utf8 on entire sentence instead of each word separately [MRG] Phrases: apply any2utf8 on entire sentence instead of each word separately Jun 27, 2017
@prakhar2b prakhar2b changed the title [MRG] Phrases: apply any2utf8 on entire sentence instead of each word separately [WIP] Phrases: make any2utf8 tokenization optional Jun 27, 2017
@@ -133,19 +133,32 @@ def __init__(self, sentences=None, min_count=5, threshold=10.0,
`delimiter` is the glue character used to join collocation tokens, and
should be a byte string (e.g. b'_').

`recode_to_utf8` is an optional parameter (default True) for any2utf8 conversion of input sentences
Copy link
Owner

@piskvorky piskvorky Jun 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about

By default, the input sentences will be internally encoded to UTF-8 bytestrings, to save memory and ensure valid UTF-8. Set recode_to_utf8=False to skip this recoding step in case you don't care about memory or if your sentences are already bytestrings. This will result in much faster training (~2x faster).

"""
if min_count <= 0:
raise ValueError("min_count should be at least 1")
min_count = 1
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove these two checks.

self.min_count = min_count
self.threshold = threshold
self.max_vocab_size = max_vocab_size
self.vocab = defaultdict(int) # mapping between utf8 token => its count
self.min_reduce = 1 # ignore any tokens with count smaller than this
self.delimiter = delimiter
self.is_bytes = True # for storing encoding type in vocab for supporting both unicode and bytestring input
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment hard to understand, and I think it's because the logic is not clear. What is it that is bytes in is_bytes?

Isn't it better to simply create a flag of whether the input sequences are bytes or not?

Then the comment becomes # do the input sentences consist of bytestrings? which is clear.

Also, this check seems to belong to learn_vocab, not here.

Copy link
Contributor Author

@prakhar2b prakhar2b Jun 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT - ok, I think self.is_input_bytes should be ok

Adding this comment (and I think it's rightly in init) -

self.is_bytes = True  # do the input sentences consist of bytestrings?

# With default (recode_to_utf8=True) we encode input sentences to utf8 bytestrings, but
# with recode_to_utf8=False, we retain encoding, so need to store this encoding
# information to later convert token inputs accordingly (in __getitem__ and export_phrases)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's rightly in init, then what happens if sentences=None and user is calling learn_vocab manually later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piskvorky updated the PR

@@ -133,19 +133,24 @@ def __init__(self, sentences=None, min_count=5, threshold=10.0,
`delimiter` is the glue character used to join collocation tokens, and
should be a byte string (e.g. b'_').

`recode_to_utf8`- By default, the input sentences will be internally encoded to
UTF-8 bytestrings, to save memory and ensure valid UTF-8. Set recode_to_utf8=False
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best to render code blocks as literal text (put recode_to_utf8=False in backticks).

`recode_to_utf8`- By default, the input sentences will be internally encoded to
UTF-8 bytestrings, to save memory and ensure valid UTF-8. Set recode_to_utf8=False
to skip this recoding step in case you don't care about memory or if your sentences
are already bytestrings. This will result in much faster training (~2x faster)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing full stop at the end of sentence.

if self.recode_to_utf8:
s = [utils.any2utf8(w) for w in sentence]
else:
s = [utils.any2utf8(w) for w in sentence] if self.is_input_bytes else list(sentence)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a bug -- recode is False but recoding happens.

Copy link
Contributor Author

@prakhar2b prakhar2b Jun 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piskvorky is it true that only unicode string is accepted (and not bytestrings) as input token in getitem or export_phrases as mentioned here in phrases.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piskvorky this conversion was apparently meant to handle the mismatch between bytestrings training input and unicode token input . Is this an unprecedented use case, should we just raise a warning (like TypeError for differently encoded training and token input)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't understand. If the user said "I don't want recoding", we shouldn't be recoding.

What is the motivation for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we are not doing recoding while training Phrases.

For example, for recode_to_utf8=False and for unicode input sentences (for training), we have unicode words in vocab. Now if we provide bytestrings tokens in getitem or export_phrases, this mismatch will be a problem here in phrases.py. (or vice-versa)

For, no recoding at all, I think user will have to use same encoded corpus for both training, and phrases retrieval (getitem / export_phrases)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't be recoding implicitly in __getitem__. The only gain is in the case when learn_vocab receives bytestrings, recode_to_utf8 is False and __getitem__ receives unicode. I don't think that justifies the dangerous subtle errors that it could cause.

Also, if we're going to be implicitly handling different input formats for learn_vocab vs __getitem__, we're also going to have to take care of the delimiter here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think delimiter issue is sorted in learn_vocab here.

Just for information, what could be those dangerous errors ? The idea is simply to do the conversion for incoming token in __getitem__ to match the encoding in vocab.

Copy link
Contributor

@jayantj jayantj Jul 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @piskvorky mentioned, suppose we had recode_to_utf8=False, and learn vocab was called with latin2 bytestrings, and latin2 bytestrings are sent to __getitem__. We'd silently recode the latin2 bytestrings to utf8 in __getitem__ and lookup would fail, even though it should succeed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, yes I understand. I think fail explicitly should be the correct choice then (for mismatch in training input and infer input for recode_to_utf8=False). What kind of error should we raise then for this mismatch ?

if self.recode_to_utf8:
s = [utils.any2utf8(w) for w in sentence]
else:
s = [utils.any2utf8(w) for w in sentence] if self.is_input_bytes else list(sentence)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a bug -- recode is False but recoding happens.

if self.recode_to_utf8:
s = [utils.any2utf8(w) for w in sentence]
else:
s = [utils.any2utf8(w) for w in sentence] if self.is_input_bytes else list(sentence)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a bug -- recode is False but recoding happens.

"""Collect unigram/bigram counts from the `sentences` iterable."""
if not self.recode_to_utf8 and sentences is not None:
sentence = list(next(iter(sentences)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this will raise an exception if sentences is either an empty list or an empty generator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, ideally a simple test should be catching this too.

"source": [
"# currently on develop --- original code\n",
"from gensim.models import Phrases\n",
"bigram = Phrases(Text8Corpus(text8_file))\n",
Copy link
Owner

@piskvorky piskvorky Jun 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better convert the streamed iterable to an in-memory list (using list()), it's small enough. That way we don't have to iterate over the file from disk every time.

This will make the benchmark conclusions stronger (less noise and delays from other, unrelated parts of the code, IO overhead etc).

try:
sentence = list(next(iter(sentences)))
except:
raise ValueError("Input can not be empty list or generator.")
Copy link
Owner

@piskvorky piskvorky Jun 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1: why should empty input suddenly become a special case, raising an exception?

Copy link
Contributor Author

@prakhar2b prakhar2b Jul 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this concern was raised here in the review comment above, and also as discussed on gitter, sentences= [ ] is not None but it will throw an error with iter in above case

Copy link
Owner

@piskvorky piskvorky Jul 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but that's a reason to fix the bug, not change the API :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I should have just logged a warning Empty sentences provided as input in except block and no need to raise the error.

Copy link
Owner

@piskvorky piskvorky Jul 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless it's really a special case (and I don't think it is), there's no need to treat it in a special way.

No exception, no warning -- there's nothing special to an empty corpus with regard to Phrases, except your new check for its first element. It's not a special case.

@piskvorky piskvorky changed the title [MRG] Phrases: make any2utf8 optional [WIP] Phrases: make any2utf8 optional Jul 6, 2017
@piskvorky
Copy link
Owner

piskvorky commented Jul 6, 2017

@prakhar2b can you please fix the two issues I pointed out last week (list and no special case)?
Let's get this PR over with finally.

self.delimiter = utils.to_unicode(self.delimiter)
self.is_input_bytes = False
sentences = it.chain([sentence], sentences)
except:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using catch-all except blocks is generally a bad idea, since you could end up catching unexpected exceptions. So catch only the specific exception expected here (StopIteration?)

self.is_input_bytes = False
sentences = it.chain([sentence], sentences)
except:
# No need to raise any exception or log any warning, as it's not a special case.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug message here would serve well too.

@prakhar2b
Copy link
Contributor Author

I have a question regarding behaviour of phrases with bytestrings infer input.
Suppose we have -

b_sent = [b'survey', b'user', b'computer', b'system', b'response', b'time']
sent = [u'survey', u'user', u'computer', u'system', u'response', u'time']

If we do bigram[sent], we get expected ['survey', 'user', 'computer_system', 'response', 'time'].
However, bigram[b_sent] returns <gensim.interfaces.TransformedCorpus object at 0x7f73a7d466a0>.

I added this test (for bytestrings infer input) for recode_to_utf8=False but the behaviour is same for recode_to_utf8=True as well. I find it difficult to understand this behaviour. Therefore, before making any further changes, it would be better to know the POV of person who implemented it, if it is an intended behaviour or a bug ?

cc @gojomo @piskvorky @jayantj

@piskvorky
Copy link
Owner

piskvorky commented Jul 17, 2017

Not sure how that comes about, but looks like a bug to me @prakhar2b (not intended behaviour).

@prakhar2b prakhar2b closed this Jul 29, 2017
@prakhar2b prakhar2b reopened this Jul 29, 2017
@menshikh-iv
Copy link
Contributor

What's status @prakhar2b?

@menshikh-iv menshikh-iv closed this Aug 8, 2017
@piskvorky
Copy link
Owner

piskvorky commented Aug 8, 2017

@menshikh-iv should I reopen my #1454 , since this was closed?

@menshikh-iv
Copy link
Contributor

@piskvorky I think no, now Filip works with #1446 as I understand

@piskvorky
Copy link
Owner

@menshikh-iv #1446 is completely orthogonal to this. You can convert to utf8 with/without a memory-bounded counter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants